========================================================
This report looks at the effect of 11 variables on white wine quality in a data set of almost 4900 wines. It was produced as part of the Udacity Data Analyst Nanodegree program. The data is taken from:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The data set contains 13 columns.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The first column ‘X’ appears to be a row count and can be ignored in the analysis. That leaves the following independent variables for analysis:
There is one dependent variable:
Quality is a categorical value and will need to be converted to a factor for analysis.
wine$quality <- as.factor(wine$quality)
That leaves 11 variables which may influence the quality of a wine to explore.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Quality forms a normal curve with most wines in the data set receiving the median rating. The best and worst wines are barely represented, with only 5 wines ranked 9 and 20 wines ranked 3 compared to the 2198 wines ranked 6. Either there’s a bias in how the data was collected, leading to end values being excluded, or it’s hard for most wines to stand out as either great or terrible.
The three acidity measures roughly fit normal distributions with long tails to the right. A few wines in the data set have high levels of acetic (measured by volatile acidity) or citric acid.
A histrogram of the pH doesn’t reflect the long tail seen in the three acids plots, however. pH mostly has a normal distribution with a few levels over represented.
Sugar and chlorides are skewed to the left with long rightward tails.
The three sulfur measurements show a similar leftward skew, but with most of the data forming a normal distribution and then a small number of samples stretching to the right. Sulphates has interesting gaps every 0.1 g/L.
Density shows the same leftward skew as previous measurements.
Alcohol has an interesting pattern similar to sulphates, but with gaps in the data at more frequent intervals. Certain alcohol levels have either no or few samples. Maybe this is a result of how alcohol levels are measured and rounded to the nearest value.
A log transform gives the acidity measurements normal distributions and reveals the same stacatto pattern present in alcohol and sulphates.
Log transforming citric acid seems to reverse the skew and result in a long leftward tail instead of a rightward tail.
A log transform of residual sugar reveals a bimodal distribution.
Chlorides and free sulfur dioxide have their distributions pulled more towards the center by a log transform.
A log transform of density doesn’t change the distribution much. The long rightward tail is most likely caused by outliers rather than the distribution of the data.
wine$fixed.acid.log <- log10(wine$fixed.acidity)
wine$volatile.acid.log <- log10(wine$volatile.acidity)
wine$chlorides.log <- log10(wine$chlorides)
wine$free.sulfur.dioxide.log <- log10(wine$free.sulfur.dioxide)
wine$sugar.log <- log10(wine$residual.sugar)
Log transformations of variables to either give them a normal distribution or reveal a bimodal distribution in the case of residual sugar.
There are 4898 observations of 12 variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. Most of the variables are continuous except for quality which is categorical with levels from 3 (worst quality) to 9 (best quality) counting by whole numbers.
Most wines are of quality 4, 5, or 6. Many of the variables have a leftward skew. Residual sugar has a bimodal distribution. Alcohol measurements show a stacatto pattern of many measurements followed by few or none.
Quality is the dependent variable in the data set. The other variables presumably affect the rating that a wine taster gives. I’m curious which variables best correlate with quality.
I would guess variables involving acidity, chlorides, and sulfur would affect the taste of wine and influence a taster’s rating.
I created log transformed variables of fixed acidity, volatile acidity, chlorides, free sulfur dioxide, and residual sugar based on histograms that demonstrated those variables took on normal distributions when log transformed (or in the case of sugar had a bimodal distribution). I also converted quality into a factor so R will treat it as a categorical variable.
Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, and density all showed a strong leftward skew in the data. I performed log transformations on these variables, which resulted in fixed acidity, volatile acidity, chlorides, and free sulfur dioxide forming distributions closer to normal. Citric acid’s distribution inverted, obtaining a leftward tail as opposed to a rightward tail. Residual sugar appears to have a bimodal distribution when log transformed. The log transformation did not affect the density distribution, suggesting that the long rightward tail is the result of outliers and not the bulk of the data.
There are high positivie correlations between residual sugar and density, total sulfur dioxide and density, free sulfur dioxide and total sulfur dioxide, and residual sugar and total sulfur dioxide.
There are high negative correlations between alcohol and density, alcohol and residual sugar, alcohol and chlorides, alcohol and total sulfur dioxide, and pH and fixed acidity.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654 7.9 0.330 0.28 31.6
## 1664 1664 7.9 0.330 0.28 31.6
## 2782 2782 7.8 0.965 0.60 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1654 0.053 35 176 1.01030 3.15
## 1664 0.053 35 176 1.01030 3.15
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality fixed.acid.log volatile.acid.log
## 1654 0.38 8.8 6 0.8976271 -0.48148606
## 1664 0.38 8.8 6 0.8976271 -0.48148606
## 2782 0.69 11.7 6 0.8920946 -0.01547269
## chlorides.log free.sulfur.dioxide.log sugar.log
## 1654 -1.275724 1.544068 1.499687
## 1664 -1.275724 1.544068 1.499687
## 2782 -1.130768 0.903090 1.818226
Three observations have densities greater than 1.01. These are probably responsible for the long rightward tail on the density histogram that a log transformation could not correct. They could be influencing the high correlations observed above, so I’ll remove them from subsequent plots and analyses.
## [1] 0.8320888
There’s a positive correlation between residual sugar and density, which makes sense as more sugar would make a wine denser.
A scatterplot using the log transformed residual sugar variable reveals the bimodal distribution. Low sugar wines have a homogenous dispersal with regards to density while high sugar wines have a positive linear relationship with density. Other variables may be influencing density in low sugar wines, resulting in a lack of a trend, while in high sugar wines, sugar is the main factor influencing density.
## [1] 0.7665352
The correlation between density and residual sugar is about 6% smaller when the bimodal distribution is taken into account. A linear correlation is obviously not the best model for comparing these two variables.
## [1] -0.8041518
A negative correlation is seen between alcohol and density, again making sense as sugar gets converted into alcohol during fermentation. If sugar directly contributes to density, then density will decreases as the sugar is consumed by yeast.
## [1] -0.4591654
I would expect alcohol and residual sugar to correlate since sugar is converted into alcohol during fermentation, and both variables have a correlation with density of around 80%. There is a negative correlation between the two variables, but not as high as 80%. The bimodal distribution in sugar complicates the comparison.
## [1] -0.3937291
Using the log transformed sugar variable reduces the strength of the correlation.
## [1] 0.2891867
Citric acid has a correlation of around 29% with fixed acidity, implying that citric acid explains a little under a third of fixed acidity. Nothing else correlates as highly with fixed acidity on the correlation matrix, however.
## [1] 0.2925708
While log transforming fixed acidity improves the distribution of the fixed acidity histogram, the transformation does not contribute much to the correlation with citric acid.
## [1] 0.5424823
## [1] 0.2602505
In addition to residual sugar, total sulfur dioxide and chlorides also contribute to the density of a wine.
## [1] 0.400329
The log transformed chlorides variable shows a higher correlation than the base variable.
## [1] -0.5012869
There’s a negative trend where the more chlorides a wine has, the less alcohol it has as well.
## [1] 0.1469724
There’s no obvious relationshp between chlorides and sugar.
## [1] -0.4487925
As with chlorides, there’s a negative trend where the more total sulfur dioxide a wine has, the less alcohol it has.
## [1] 0.4182469
The more sugar a wine has, the more total sulfur dioxide it also has.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
There seems to be a strong trend towards higher alcohol wines receiving higher ratings. This pattern is reversed for the first three rankings, with quality increasing with decreasing alcohol. Other factors may affect ratings at the first three ranks, but then alcohol becomes a driver of quality.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.392 9.900 26.050
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1549 0.2007 0.6628 0.6233 1.0293 1.2095
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1549 0.1139 0.3979 0.4872 0.8513 1.2443
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2553 0.8451 0.7036 1.0607 1.3711
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1549 0.2304 0.7243 0.6457 0.9956 1.4158
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.04576 0.23045 0.56225 0.56820 0.86480 1.28443
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.09691 0.32222 0.63347 0.62019 0.91378 1.17026
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2041 0.3010 0.3424 0.4992 0.6232 1.0253
The ranges for residual sugar overlap with each other at each quality level, and plotting the log of residual sugar doesn’t reveal any new patterns.
## subset(wine.subset, sugar.log < 0.5)$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.188 1.475 1.562 1.700 2.900
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.125 1.350 1.473 1.700 3.000
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.200 1.400 1.533 1.800 3.150
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.200 1.500 1.644 1.900 3.100
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.400 1.600 1.783 2.200 3.100
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.400 1.800 1.801 2.125 2.900
## --------------------------------------------------------
## subset(wine.subset, sugar.log < 0.5)$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 1.800 2.000 1.933 2.100 2.200
## subset(wine.subset, sugar.log > 0.5)$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.500 4.975 10.050 9.613 12.100 16.200
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 5.100 7.400 8.153 10.600 17.550
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.20 7.00 9.35 10.21 13.10 23.50
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 6.200 8.500 9.369 12.400 26.050
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 4.800 6.900 8.142 10.900 19.250
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 4.700 7.100 8.131 10.900 14.800
## --------------------------------------------------------
## subset(wine.subset, sugar.log > 0.5)$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.2 5.8 7.4 7.4 9.0 10.6
The bimodal distribution of sugar is probably concealing any patterns in the boxplots. I split the data along 0.5 g/L for residual sugar as that is where the two groups separate on a scatterplot graph. There is a trend of increasing means for residual sugar level with increasing quality for the < 0.5 g/L wines. There doesn’t appear to be a pattern for the > 0.5 g/L wines.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.575 7.300 7.600 8.525 11.800
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800 6.400 6.900 7.129 7.600 10.200
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 6.400 6.800 6.934 7.400 10.300
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.836 7.300 14.200
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.200 6.700 6.735 7.200 9.200
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.800 6.657 7.300 8.200
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.60 6.90 7.10 7.42 7.40 9.10
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6232 0.8177 0.8632 0.8702 0.9307 1.0719
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6812 0.8062 0.8388 0.8483 0.8808 1.0086
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6532 0.8062 0.8325 0.8379 0.8692 1.0128
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5798 0.7993 0.8325 0.8316 0.8633 1.1523
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6232 0.7924 0.8261 0.8256 0.8573 0.9638
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5911 0.7924 0.8325 0.8198 0.8633 0.9138
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8195 0.8388 0.8513 0.8676 0.8692 0.9590
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2602 0.3000 0.7850
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7696 -0.6244 -0.5850 -0.5100 -0.3864 -0.1938
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.95861 -0.56864 -0.49485 -0.45775 -0.33734 0.04139
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.61979 -0.55284 -0.54145 -0.46852 -0.04335
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0969 -0.6990 -0.6021 -0.6067 -0.5229 -0.1051
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0969 -0.7212 -0.6021 -0.6061 -0.4949 -0.1192
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9208 -0.6990 -0.5850 -0.5878 -0.4815 -0.1805
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6198 -0.5850 -0.5686 -0.5322 -0.4437 -0.4437
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
The three acidity measures have no apparent relationship to quality.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.0360 0.0430 0.0452 0.0490 0.2550
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6576 -1.4410 -1.3872 -1.3336 -1.2678 -0.6126
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8861 -1.4202 -1.3372 -1.3318 -1.2676 -0.5376
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0458 -1.3979 -1.3279 -1.3187 -1.2757 -0.4609
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8239 -1.4437 -1.3665 -1.3709 -1.3098 -0.5935
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9208 -1.5086 -1.4318 -1.4335 -1.3565 -0.8697
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8539 -1.5229 -1.4437 -1.4364 -1.3565 -0.9172
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.745 -1.678 -1.509 -1.576 -1.495 -1.456
The means for chloride levels trend downward with increasing quality. Amount of salt in a wine could directly influence a taster’s rating, or this pattern could just be from chloride’s positive correlation with density and alcohol’s negative correlation with density.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.25 33.50 53.33 47.50 289.00
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 18.00 23.36 30.50 138.50
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 22.00 35.00 36.43 50.00 131.00
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 24.00 34.00 35.66 46.00 112.00
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.13 41.00 108.00
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 28.00 35.00 36.72 44.50 105.00
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 27.0 28.0 33.4 31.0 57.0
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 105.8 159.5 170.6 210.0 440.0
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 85.0 117.0 125.3 171.5 272.0
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 121.0 151.0 150.9 182.0 344.0
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18 107 132 137 164 294
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.1 144.2 229.0
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 102.5 122.0 126.2 150.0 212.5
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85 113 119 116 124 139
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2800 0.3800 0.4400 0.4745 0.5425 0.7400
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.3800 0.4700 0.4761 0.5400 0.8700
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2700 0.4200 0.4700 0.4822 0.5300 0.8800
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4800 0.4911 0.5500 1.0600
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4800 0.5031 0.5800 1.0800
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.3800 0.4600 0.4862 0.5850 0.9500
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.360 0.420 0.460 0.466 0.480 0.610
Amounts of sulfur dioxide and sulphate have no apparent relationship with quality.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
There might be an increase in quality with increase in pH. However, there are so few wines at rank 9 that that mean is questionable when compared to the other ranks.
## wine.subset$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0001
## --------------------------------------------------------
## wine.subset$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0004
## --------------------------------------------------------
## wine.subset$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
## --------------------------------------------------------
## wine.subset$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9939 0.9959 1.0030
## --------------------------------------------------------
## wine.subset$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0004
## --------------------------------------------------------
## wine.subset$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
## --------------------------------------------------------
## wine.subset$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9897 0.9898 0.9903 0.9915 0.9906 0.9970
The boxplots for density roughly mirror those for alcohol, with density decreasing from quality ranks 5 to 9. There is an increase in density from rank 4 to 5, matching the decrease in alcohol from 4 to 5 on those boxplots. This should all be expected due to the negative correlation between alcohol and density.
Some of the features I thought would correlate with quality (acidity and sulfur) did not. Chlorides did show a trend with the mean amount of salt in a wine decreasing with increasing quality. The strongest pattern though came from alcohol, which appears to be the main driver of a taster’s rating. The more alcohol a wine has, the higher its rating.
Residual sugar, total sulfur dioxide, and chlorides poitively correlate with density, as should be expected as increasing solutes increases the density of a solution. Alcohol negatively correlates with density, again as expected since sugar is converted into alcohol during fermentation.
Quality ranks 5 through 9 each show increasing levels of alcohol. Alcohol appears to be the main predictor of a wine taster’s rating.
Higher quality wines have a lower density regardless of sugar level.
High quality wines concentrate in the high alcohol, low density levels.
Both low and high sugar wines have higher quality ratings in the higher alcohol ranges.
As quality level increases, the majority of wines in each rank shift to the right toward higher alcohol levels. There’s a distinct split between high and low sugar wines, but these groups don’t shift with quality the way alcohol does.
High quality wines have low amounts of chlorides, low densities, and high alcohol levels.
The split in the residual sugar observations happens around 0.5 g/L. Dividing the alcohol by chlorides plots along this line doesn’t reveal any new details.
High quality wines appear to have less total sulfur dioxide than low quality wines.
High quality wines have low density and low total sulfur dioxide levels.
High quality wines have high alcohol and low total sulfur dioxide levels.
Higher quality wines cluster in the lower lefthand corner of low chlorides and low total sulfur dioxide.
The highest quality wines tend to have high levels of alcohol, low densities, and low chloride and total sulfur dioxide levels independent of sugar level. Sugar, chloride, and sulfur dioxide all contribute to density. Since high quality wines tend to have low densities independent of sugar, density is probably a proxy for chloride and sulfur dioxide levels.
High quality wines tend to have a low density even when residual sugar levels are high. Since a scatterplot of residual sugar versus density showed a high correlation between those two variables, other variables that increase density, such as chlorides and sulfur dioxide, must be lower in high sugar wines of high quality. Plots of residual sugar versus chlorides and residual sugar versus total sulfur dioxide with points colored for quality showed this, with high sugar, high quality wines being on the low end of the chloride and total sulfur dioxide scales.
The median percentage of alcohol in a wine increases with quality across the middle three grades where the highest number of observations are. Quality grades 5, 6, and 7 have 1457, 2198, and 880 observations respectively. There is a decreases in percentage of alcohol from grades 3 to 5, however, grade 3 has 20 observations and grade 4 has 163. These low numbers compared to the middle three grades could be skewed by an over representation of higher alcohol content wines. Likewise, even though grades 8 and 9 fit the pattern of increasing alcohol with increasing quality, the number of observations in each (175 and 5 respectively), means conclusions from those should be viewed with caution.
There is a bimodal distribution for residual sugar in the white wine data set. This probaby reflects the distinction between sweet and dry wines. High alcohol wines will be ranked as high quality by wine tasters regardless of sugar level. In the upper left hand corner, an island of high quality wines can be seen that are close to the lowest alcohol levels. These are also some of the sweetest wines. It appears that the one exception to high alcohol wines receiving high ratings is high sugar wines.
The highest quality wines also have some of the lowest salt (sodium chloride) and total sulfur dioxide levels. Wine tasters primarily enjoy alcohol while disliking flavors introduced by salt and sulfur dioxide.
The white wine data set contains about 4900 wines and 12 variables. I began with a series of histograms on each variable to get a sense for the shape of the distributions. A lot of variables were skewed to the left with long rightward tails. Log transformations gave these variables normal distributions except for density where only three observations were forming the long rightward tail. I considered these points outliers and removed them from subsequent analyses due to how few observations there were. The one exception to the normal distributions was residual sugar, which had a binomial distribution. This is most likely due to wines traditionally being either dry or sweet. The most interesting part of the histograms were gaps in the data for variables such as alcohol or volatile acidity once it was log transformed. The gaps seem most common on the left side of the histograms, in the lower range of the relevant unit. Perhaps this is an artifact of measurement processes that round to a nearest value.
I was most surprised at how clearly alcohol percentage predicted wine quality in a boxplot, while every other variable had little to no difference between the interquartile ranges of the different quality ranks. Wine tasters appear to be biased by alcohol over other factors contributing to flavor. However, the first few ranks contradict this with alcohol percentage decreasing with increaseing quality from ranks 3 to 5. This might be the result of much fewer observations in ranks 3 and 4 though (20 and 163 observations respectively) compared to rank 5 (1457 wines). A small number of observations is more likely to be biased by extreme values. For this reason, conclusions about ranks 8 and 9 (175 and 5 observations respectively) should also be viewed with caution.
It was also interesting to see the relationship of several variables with density. Sugar, sodium chloride, and total sulfur dioxide all positively correlated with density while alcohol negatively correlated. The fact that solutes contribute to the density of a solution and sugar is converted into alcohol during fermentation is visible in the data set.
The largest issues I had were figuring out how to handle the bimodal distribution of residual sugar and investigating the other variables influencing density. I split the data set in half for two analyses involving residual sugar, looking at boxplots of sugar versus quality for wines with < 0.5 g/L of residual sugar and > 0.5 g/L, as well as scatterplots of alcohol versus chloride content split by sugar level. There’s a pattern of dry wines receiving a higher rating with higher sugar content (so the wine tasters don’t want their dry wines too dry?). No pattern was observed for sweet wines, but they have so much sugar to begin with that differences in residual sugar levels may not be enough to affect quality scores. It makes sense that level of sugar would have a larger impact in dry wines.
Chlorides and sulfur dioxide are trickier to interpret. High sugar wines receive a higher score when they have a lower density. Chlorides and sulfur dioxide contribute to density, implying that higher chloride and sulfur dioxide levels reduce a wine’s score. However, while high chloride and sulfur dioxide wines have lower scores, they also have less alcohol.
I would like to examine the companion red wine data set to test how it compares. More observations for the lowest and highest quality wines would also be extremely helpful to test if the pattern of rising quality with rising alcohol percentage holds. The red wine data set has fewer observations than this one, so combining the two wouldn’t necessarily address this issue.